home *** CD-ROM | disk | FTP | other *** search
- ======================================================================
- P O O
- doc: Thu Apr 2 11:59:51 1992
- dlm: Wed Jul 21 14:37:58 1993
- (c) 1992 ant@ips.id.ethz.ch
- uE-Info: 266 0 NIL 0 0 72 3 2 8 ofnI
- ======================================================================
-
- This file describes some interna of the inetray-packet. It's name
- derives from Principles of Operation.
-
- Overview
- --------
- The program inetray is responsible for dispatching and scheduling the
- rayshade requests. In the usual terminology it acts as the client
- requesting services from a number of remotely running servers. It does
- that using SUN RPC. Rendering requests do not block, therefore inetray also
- listens continuously on a socket to check for incoming results. The data
- which is received from the workers is written to the file whenever this
- is possible.
-
- The program rpc.inetrayd serves two purposes: it services a number of
- rpc-requests dealing with initialization and management. Whenever it
- receives a rendering request, it spawns of a worker child and continues
- to service rpc requests (a restricted number).
- The worker now renders a part of a frame and then directly contacts the
- dispatcher to send it the result. This is done using a XDR/TCP
- connection.
-
- inetray.start is a simple RPC daemon servicing requests for starting the
- rpc.inetrayd servers.
-
- Rayshade Libraries
- ------------------
- Inetray uses the standard Rayshade libraries (as used in version 4.0.6).
- Great care has been taken to avoid having to change the libraries at
- all. As long as the interface stays the same no change is required for
- Inetray even if rayshade evolves.
- At times this decision not to change the rayshade source lead to
- complicated and maybe clumsy solutions. But in the interest of
- portability it has nevertheless been adhered to strictly.
- There is one currently unsolved problem arising from this decision: it
- seems that the random generator used for textures is different on big
- and little-endian machines. Therefore they don't mix. A solution has
- been promised for rayshade 5 by Craig Kolb but so far it isn't clear
- when it'll be out.
-
- Input
- -----
- rayshade accepts its input in various ways:
-
- 1) rayshade "filename" (input is file)
- 2) ... | rayshade (input is stdin/pipe)
- 3) rayshade < "filename" (input is stdin/file)
- 4) rayshade (input is stdin/keyboard)
-
- Inetray from version 1.1.0 on provides total compatibility with all
- those possibilities. It does this by buffering stdin in a live buffer
- (see below). RSInitialize() is then called taking its stdin from this
- buffer. Note that in case 1 the buffer is not needed and should indeed
- disappear because it competes with inetray for the keyboard. This
- problem is solved in a very simple manner: on buffer startup SIGINT is
- set to kill the buffer; once the buffer encounters an eof (i.e. cases 2
- 3 & 4, where eof is necessarily encountered before RSInitialize()
- returns) SIGINT is ignored. Inetray (the parent) sends a SIGINT to the
- buffer on return from RSInitialize. If the buffer still has the stdin
- open (necessarily the keyboard) it is killed, otherwise it continues
- running.
- Case 1 was the only one allowed for Inetray up to version 1.0.1. It
- requires the input file to exist on all worker-machines. To increase
- the flexibility, the inetray workers try to have the ``same''
- working-directory as the dispatcher (see section Pathnames below).
- Note that even when stdin is used for input, the input can contain
- references to other files to be read in, namely cpp #includes and
- height-fields. Those files must be accessed much in the same way as the
- files in case 1 (see above and section Pathnames). If no such files
- must be included then no file has to be accessible on the worker
- machines.
-
- Case 4 is handled much like cases 2 & 3.
-
- Live Buffers
- ------------
- A live buffer is just a forked process which first reads from one
- filedesc into memory (malloc'ed) and then writes the contents of the
- buffer to another (in some cases two) filedesc before terminating.
- End of input is detected when either an eof is reached or a \0 is read
- as the last character of a read() syscall. This feature allows to use
- live buffers to read from TCP connections which should not be closed.
- Live buffers are more expensive than writing a temporary file for large
- amounts of data and worsen the already problematic memory situation but
- they avoid having servers writing files which could be a possible
- security problem (see below).
- Note that live buffers terminate automatically once their respective
- parents disappeared. This is due to the fact that eventually they will
- encounter an eof on the input filedesc and start writing. They always
- write to a pipe so when the last reader of the pipe died they die on a
- SIGPIPE.
-
- Authentication & Security
- -------------------------
- If the servers (rpc.inetrayd) are started as root, they try to change to
- the user id supplied to them. This is usually the user id of the user
- running the dispatcher (inetray). Any user, however can set a different
- user id for servers started by inetray.start. No server can run as root
- (uid == 0).
- If the uid is illegal on the server it exits with an error message in
- the syslog.
- No server ever produces an output file. This therefore limits the
- security concerns to changing the access time of files. Of course it is
- possible that there are loopholes in this concept; I just haven't found
- one yet.
- If the server is not started as root, it will continue to run under the
- uid it was started as. One has to check the permissions of the accessed
- files for reading access for that user.
- The actual usernames under which the servers are running is diplayed by
- both inetray and inetray.ping.
-
- Session Keys
- ------------
- Whenever a started server receives the first request, a session key is
- sent with that. Once a sessions key is installed, only requests with the
- same key are serviced. In practice this means that only the person who
- issued a inetray call can kill the running servers and workers. The key
- is stored in the file .inetray.key in the current directory where
- inetray was issued. An eventually existing file is renamed to
- .inetray.key.old.
- inetray displays the current session key on startup.
- inetray.ping uses the special key 0. Therefore, if servers hang after a
- inetray.ping, they can be killed with inetray.kill 0.
- The program inetray.kill needs a session key supplied with. If one is
- given as an argument, this takes precedence. If no key is supplied,
- inetray.kill looks for one in the file .inetray.key.
-
- Version Numbers
- ---------------
- Both inetray and rpc.inetrayd know about their version number. The
- server passes this back to inetray and inetray.ping upon reception of
- the first request. Only if the first character (i.e. mayor version
- number) of this version number matches, the worker is accepted.
-
- Pathnames
- ---------
- Since servers can be running on machines with totally different
- filesystems but may need to access the inputfiles locally, some
- pathname substitution is supported.
- All filenames are transferred as-is to the servers/workers. If they
- start with a / they are absolute path-names starting at the respective
- root of the machines. This will probably not work well on all but the
- most homogenous networks.
- If they don't start with a / they are relative names starting in the
- current working directory. From the working directory where the client
- is started the home-part is stripped if possible. This stripped
- directory is then sent to the server which in turn adds the home
- directory of the uid it is to run as.
- Note that if nothing was stripped on the client-side, then nothing is
- added on the server-side. Note also that the right directory is chosen
- even when the server cannot run under that user id.
- The server tries to chdir to the directory so constructed. If that
- fails it continues to run in the current directory which is the
- directory where it was started from. The working directories of the
- servers are displayed by both inetray and inetray.ping.
- The practical abshot is that if you have the same sub-directory
- structure below your home on the different machines, you can start
- Inetray in all these directories and the servers/workers will cwd() to
- the right sub-dirs as well.
-
- Port Numbers
- ------------
- The rendered portions of a frame are sent back using a XDR/TCP
- connection. The portnumber for this is defined in config.h (RESULTPORT)
- but can be overridden for each user in the .inetrayrc file.
-
- Registering Servers
- -------------------
- Whenever inetray or inetray.ping are started, they try to register ready
- servers.
- First, the servers started by inetray.start are started; the servers
- started by inetd are started automatically when an INIT-request arrives.
- The order in which the machines are contacted is the following:
- 1: All simple hosts given in the Use List (if any)
- 2: All directed broadcasts addresses in the Use List (if any)
- 3: The Local Network (if option N=0 is not set in the Use List)
- After starting, an INIT-request is sent to all machines. Servers that are
- to be started by inetd, are started automatically when they receive an
- INIT-request. The same order applies.
- Servers reply by opening a TCP-connection on the result-port and sending
- back status info.
- Answers may be ignored for two reasons: either the hostname appears
- (exactly as given) in the ignore list in the current .inetrayrc or the
- mayor version number of the server does not match that of the
- dispatcher.
- If the input comes from stdin, then the contents of the live buffer (see
- above) of the dispatcher is sent to live buffers on the server machines
- using the same TCP-connection. This is, however, only done once
- registering is otherwise completed (i.e. the list of registered machines
- is complete).
-
- Work Scheduling
- ---------------
- A frame is divided into blocks encompassing > 1 lines. This is done
- according to a simple heuristics the parameters of which can be
- controlled by editing config.h and/or overriding those values in a
- .inetrayrc file (see INSTALL/Appendix B for details).
- After n workers have been registered, the block size is calculated as
- follows: blockSize = ySize / blocksPerServer / n. After that, the size
- is checked against the lower and upper limit (MINBLOCKSIZE resp.
- MAXBLOCKSIZE). If it exceeds a limit, it is adjusted accordingly. After
- that, the size of the last, possibly incomplete, block is calculated and
- the information printed.
-
- In early versions (up to [0.2.0]), a simple round robin scheduling has
- been used: subseqent machines got subsequent blocks to trace; whenever
- the end of a frame was reached, the whole process started over with only
- the non-terminated blocks.
- This could lead to quite bad behaviour in the end. Consider for example
- the example file mole.ray. Early blocks (bottom half) take much longer
- to trace than later ones. If now one machine is heavily loaded, it won't
- ever complete its block. This means that there will one early block be
- outstanding for a very long time wich will inhibit concurrent writing.
- Furthermore, with a little bit of bad luck, this block will be the last
- one outstanding which will mean that a lot of machines will calculate
- just one block in the end. This block will take a long time to
- calculate.
- Starting with version [0.2.1] there is a rescheduling inserted in the
- middle of a frame. The number of machines which did not yet return a
- result is counted and the first n blocks (n being the number of those
- machines) not yet calculated are given priority over other blocks. These
- blocks are exactly those residing on those slow machines. Hopefully,
- these are distributed to faster machines like this.
- I my setting, this modification lead to quite a decrease in time needed
- to complete the last block.
- Notes: - The scheme presented here also works nicely if workers crash
- during the first half of a frame (which they seem to tend to
- do).
-
- For version 2.0.0 the scheduling has changed yet again. For images where
- all the hard work is done in a small part of the picture the old
- scheduler didn't work very nicely. To solve this problem the following
- scheduler has been implemented:
- - During the first round of work scheduling (i.e. until all
- blocks have been dispatched once) the blocks are always
- scheduled in pairs (i.e. one woker renders 2 blocks on every
- request).
- - When this 1st pass has been finished, only single blocks are
- dispatched.
- It's not clear if this scheduler is always better than the earlier
- versions.
-
- Concurrent Servers & RPC Program Numbers
- ----------------------------------------
- It is possible for one machine to have more than one server (and worker)
- running at a time. This feature is implemented to allow multiprocessor
- machines to have as many workers as processors running. A machine
- starting more than one worker cannot start it using inetd. Concurrent
- servers have different RPC Program Numbers. The first server gets the
- program number IRNUM defined in prognum.h. Subsequent servers get
- subsequent program numbers.
- Like that, registering with the portmapper works correctly. It must be
- noted, though, that all broadcasts to servers now must be broadcast for
- all program numbers.
-
- Error Logging
- -------------
- The general mechanism is described in README and SUPPORT.
- Please note that also all errors produced by the rayshade routines are
- logged. This is done using a funny redirection of the stderr to the
- syslog using socketpairs and async I/O. For this to work under AUX I had
- to implement the socketpair() syscall there, since the one built in does
- not work (at least in our version).
-
- Error Termination
- -----------------
- Roughly once every minute, every server checks if the dispatcher is
- still running. If that's not the case, it kills it's associated worker
- if it has one and then exits with an entry in the syslog.
- As from version 2.0.0 the server also checks the exit status of its
- child once every minute. If the child exited with a status != 0 it shuts
- itself down. This non-zero exit status can be due to two different
- reasons: either the rayshade libraries exited explicitly or the worker
- was terminated with a signal (either implicitly (bus error, segmentation
- violation, ...) or explicitly (it annoyed either your sysadm or
- yourself)).
-
- Socket State
- ------------
- Seems to me there's no clean way to extract the correct state of a
- socket without reading kernel memory. Nevertheless, the connection state
- must be retrieved for checking the state of the dispatcher. In a first
- test getpeername() was used. Unfortunately it returns the peername of
- the dispatcher even if that one has been killed (and the socked is in
- CLOSE_WAIT/FIN_WAIT_2 state).
- Up to version 1.0.1 select()'ing the socket for reading did the trick
- since it was used only as a one-way server->dispatcher connection. Thus
- being ready for reading meant an error.
- Later versions use the TCP connection to send the stdin (see Live
- Buffers above). Therefore checking the state of the connection means
- selecting it for read and testing it being empty at the same time.
- There's no UNIX syscall to do this. If it can be guaranteed that nobody
- reads the socket between selecting it for read and testing it for
- emptyness then we succeeded. Unfortunately there is a Live Buffer which
- reads the data written to the socket; this buffer is a separate process
- which does not sync itself with the server.
- The buffer can, however, not block forever in its reading state. It will
- stop reading if the buffer on the dispatcher side is exhausted or
- killed. After that it will start writing on the pipe. Therefore we can
- disallow checking the dispatcher for life while the server buffer is in
- reading state. It enters reading state immetiately when lPostBuffer() is
- called. By selecting the pipe for reading we can find out when the
- buffer is its writing state.
- Note that the socket is never written to (by the dispatcher) unless an
- INIT request has been successfully completed by the server. Therefore we
- don't even have to check for emptyness of the socket - selecting it for
- read whenever we can guarantee that the buffer is not reading it tells
- us therefore if the dispatcher is still running.
- If the buffer on the server side is killed before completing its reading
- then the server also terminates assuming the death of the dispatcher.
- This is ok. If the buffer dies during its writing period, the pipe is
- closed and returns eof for the reader which results in an error and exit
- there.
-
- Rpc.inetrayd startup
- --------------------
- The server can be started up by inetd or inetray.start (or, for
- debugging purposes, by hand). It checks its number of arguments to
- decide how it was started up. If it is called without any arguments, it
- assumes that it is started by inetd. Therefore you have to supply a
- dummy argument if you want to start it by hand.
-